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Abstract. Online forums facilitate knowledge seeking and sharing on 
the Web. However, the shared knowledge is not fully utilized due to 
information overload. Thread retrieval is one method to overcome in- 
formation overload. In this paper, we propose a model that combines 
two existing approaches: the Pseudo Cluster Selection and the Voting 
Techniques. In both, a retrieval system first scores a list of messages 
and then ranks threads by aggregating their scored messages. They dif- 
fer on what and how to aggregate. The pseudo cluster selection focuses 
on input, while voting techniques focus on the aggregation method. Our 
combined models focus on the input and the aggregation methods. The 
result shows that some combined models are statistically superior to 
baseline methods. 
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1 Introduction 

Online forums are user-generated content platforms, which enable users to build 
virtual communities. In these communities, users interact with each other to seek 
and share knowledge. The interaction happens through exchanging information 
in a form of discussions. A user starts a discussion by posting an initial message. 
Then, replies to it are contributed by the other users. A pair of an initial message 
and the set of its reply messages builds a thread. 

Thread retrieval helps forums' end users to find information satisfying their 
needs. However, the challenge is that threads are not text; they are collections of 
"messages" — the initial and the reply messages. Therefore, given a user query, 
a retrieval system needs to utilize the text of the messages to rank threads. 
The problem of the thread retrieval is similar to the problem of the blog site 
retrieval [1116 . In blog site retrieval, given a query, we leverage the blogs' postings 
in order to rank blogs. An analogy between these two retrieval problems is that 
threads are the blogs, and messages are the postings. 

Because of the thread retrieval's resemblance to the blog site retrieval, re- 
searchers, as in [5] and [10], have adapted techniques from blog distillation, such 
as |4lllj . to thread search; and, the adapted models performed well. Motivated 
by this fact, we propose to combine two known techniques on blog site retrieval: 
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Pseudo Cluster Selection(PCS)[Tl] and Voting Techniques(VT) [6]; then, we ap- 
ply the combined model to thread search. 

Both methods first rank a list of postings and then aggregate the postings' 
scores to rank their parent blogs. However, they differ in two aspects: the input 
and the aggregation method. PCS focuses on the top k ranked postings from 
each blog, whereas VT considers all ranked postings. Furthermore, PCS uses the 
geometric mean to fuse scores, while VT adapts various data fusion methods, 
such as CombMAX [T2] and expCombSUM|9], to the blog distillation task. In 
using CombMAX, blogs are ranked based on their best scoring posting; in using 
expCombSUM, a blog is scored based on the sum of the exponential values of the 
blog's ranked postings' relevance scores. It was reported that the voting methods 
that favor blogs with highly ranked postings performed better than the other 
voting methods[6]. This characteristic is similar to PCS's emphasis on highly 
ranked postings because it considers only the top k postings. 

In thread retrieval, [5] reported that scoring threads using the maximum score 
of their ranked messages is superior to using the arithmetic mean of the scores; 
and, PCS is statistically superior to both methods. Note that the arithmetic 
mean will be affected by the messages with low scores, whereas the maximum 
score method favors threads with highly ranked messages. In addition, Comb- 
MAX is a special case of PCS(fc = 1)[5]. In other words, focusing on highly 
ranked messages improve the retrieval performance. Therefore, in this study, we 
hypothesize that voting methods can achieve better performance by focusing on 
only the top k ranked messages. 

2 Related work 

The voting approach to the blog distillation task is inspired by works on data 
fusion [T2I5] and expert finding [7J. A data fusion technique aims to combine 
several ranked lists of documents generated by different retrieval methods into 
a unified list pQ. In addition, retrieval methods that retrieved a document are 
voters for that document. Then, a data fusion technique is used to fuse these 
votes. Data fusion techniques are categorized into score based and rank based 
aggregation methods. The score based methods — such as CombMAX [12], use 
the relevance scores of documents, whereas the rank based methods, such as 
BordaFuse PQ, utilize the ranking positions of these documents. 

The expert finding task is defined as retrieving a list of people who are experts 
in the topic of the user query]?]. To estimate the expertise of a person, methods 
in this task leverage the documents that are associated with or written by that 
person]?]. [7J models the problem of expert finding as a data fusion problem. 
Motivated by the success of the voting model in the expert finding task[7j, [6] 
models the blog site retrieval as voting process as well. The connection between 
these tasks is that: in both, we first rank a list of documents with respect to a 
user query using an underlined text retrieval model, then each ranked document 
is considered as a vote supporting the relevance of its associated "object" — 
the person or the blog. Indeed, in both tasks, the voting approach was found 
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to be statistically superior to baseline methods |7l6j . Similarly, in the Pseudo 
Cluster Selection(PCS) method [11] . we first rank a list of postings, and then an 
aggregation method is applied. However, in PCS, only the top k ranked postings 
from each blog are used to estimate the blog's relevance. If the number of the 
blog's ranked postings is less than k , then a padding is supplied by using the 
minimum score of all ranked postings. 

In thread retrieval, similar approach has been applied as well |5|3|10j ; that is 
ranking threads by aggregating their message relevance scores. [5] proposed two 
strategies to rank threads: inclusive and selective. The inclusive strategy utilizes 
evidences from all messages in order to rank parent threads. Two models from 
previous work on blog site retrieval [I] were adapted to thread search: the large 
document and the small document models. The large document model creates 
a virtual document for each thread by concatenating the thread's message text, 
then it scores threads based on their virtual document relevance to the query. 
In contrast, the small document model defines a thread as a collection of text 
units (messages). Then, it scores threads by adding up their messages' similarity 
scores. In contrast to the inclusive strategy, [5] 's selective strategy treats threads 
as a collection of messages; and it uses only few messages to rank threads. Three 
selective methods were used. The first one is scoring threads using only the 
initial message relevance score. The second method scores threads by taking the 
maximum score of their message relevance scores. The third method is based on 
PCS. Generally, it was found that the selective models are statistically superior 
to the inclusive models |5|3j especially the PCS method. Our work extends the 
Pseudo Cluster Selection method by investigating more aggregation methods. 

Another line of research is the multiple context retrieval approach 10J . This 
approach treats a thread as a collection of several "local contexts" — types of 
self-contained text units. Four contexts were proposed: posts, pairs, dialogues 
and the entire thread. The thread and post contexts are identical to [5] 's virtual 
document and message concepts respectively. The pair and the dialogue contexts 
exploit the conversational relationship between messages to build text units. To 
rank threads using the post, pair and dialogue contexts, PCS was used. It was 
observed that retrieval using the dialogue context outperformed retrieval using 
other contexts. Additionally, the weighted product between the thread context 
and the post, pair or the dialogue contexts achieved better performance than 
using individual contexts. In our work, we are focusing on how to combine the 
ranked contexts' scores — using the message context. Therefore, our work is 
complementary to 5j's work. 

The third line of work is the structure based document retrieval [2]. In this 
approach, a thread consists of a collection of structural components: the title, 
the initial message and the reply messages set. In this representation, the thread 
relevance to the user query is estimated using [8] 's inference network framework. 
Our work can be applied to [5] 's representation as well. We could use [2] 's infer- 
ence based relevance score the same way the thread context score was used in 

m- 
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3 Combined Models 

In this work, given a query Q = {qi,q2, ...,q n }, we first rank a list of messages 
Rq with respect to Q. Then, we score threads by aggregating their top k ranked 
messages' scores or ranks. [10] stated that the pair and the dialogue contexts 
must be extracted using thread discovery techniques, and retrieval using inaccu- 
rate extracted contexts will hurt the performance significantly. As a result, we 
use only the message context. 

In estimating the relevance between Q and a message M, we employ the query 
language model assuming term independence, uniform probability distribution 
for M and Dirichlet smoothing 13 as follows: 



q eQ \ \M\ + V J 



where q is a query term, fi is the smoothing parameter. n(q, M) and n{q, Q) are 
the term frequencies of q in M and Q respectively, \M\ is the number of tokens 
in M and P(q\C) is the collection language model. 

To rank threads, the twelve aggregation methods proposed by [7] are adapted: 
Votes, Reciprocal Rank(RR), Bordafuse, CombMIN, CombMAX, CombMED, 
CombSUM, CombANZ, CombMNZ, expCombSUM, expCombANZ and exp- 
CombMNZ. In addition to these methods, this study uses "CombGNZ" — the 
geometric mean of the relevance scores. We use this method because it is the 
aggregation method employed by the Pseudo Cluster Selection method [TT] , 

Note that our combined models aggregate only the top k ranked messages 
from each thread in order to infer a thread's relevance score. Let Rt denote the 
set of all ranked messages from a thread T, and Rr,k denote the set of the top 
k ranked messages. However, if the size of Rt is less than k, then the "set with 
less members" is used — either Rt or Rr,k, we denote to it by Rt,l- Using Rt,l, 
we can score threads using any of the following methods: 

Votes k (Q,T) = \R T , L \ (2) 

RR fc (Q,T)= £ 1 (3) 

MeR T , L rank(<2,M) 

BordaFuse fe (Q,r) = £ \R Q \-vavk(Q,M) (4) 

MeR TtL 

CombMIN* (Q,T) = MIN M€Rtl P(Q\M) (5) 
CombMED fc (Q, T) = Median m ,r t , l P{Q\M) (6) 



CombSUM fc (Q,T) = £ P(Q\M) 

MhR t ,l 



(7) 
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CombANZ fe (Q,T) = r - 1 — x £ P(Q\M) (8) 

\ K T,L\ M€R TL 



(\ \Rt,l\ 
n p (Q\M)\ (9) 

CombMNZ fc (Q,T) = | J R r , i |x £ P(Q|M) (10) 
expCombSUM fc (Q,T) = ]T exp(P(Q|M)) (11) 

MeR T ,L 

expCombANZ, (Q,T) = — L- x £ exp(P(Q|M)) (12) 
expCombMNZ fe (Q,T) = |i? T ,i|x £ exp(P(Q|M)) (13) 

MeR T ,L 

where rank(Q,M) is the rank of the message M on Rq, \Rq\ is the size of 
Rq and |-Rt,l| is the size of Rt,l- This rest of this paper reports the retrieval 
performance using these methods. 



4 Experimental Design 

Thread retrieval is a new task, and the number of test collections is limited. In 
this study, we used the same corpus used by [2]. It has two datasets from two 
forums — UbuntiQand TraveQ forums. The statistics of the corpus is as follows. 
In the Ubuntu dataset, there are 113277 threads, 676777 messages, 25 queries 
and 4512 judged threads. In the Travel dataset, there are 83072 threads, 590021 
messages, 25 queries and 4478 judged threads. The same relevance protocol was 
followed: a thread with 1 or 2 relevance judgement is considered as relevant, while 
a relevance of is considered as irrelevant. Text was stemmed with the Porter 
stemmer and no stopword removal was applied. In conducting the experiments, 
we used the Indri retrieval system [^j 

As for evaluation, we calculated Precision at 10 (P@10), Normalized Dis- 
counted Cumulative Gain at 10 (NDCG@10) and Mean Average Precision (MAP). 
In addition, we used the virtual document mode\(VD) [S] as our baseline. We used 
VD because it has been used as a strong baseline in most previous studies [5ll0j . 
In addition to VD, we used the basic aggregation methods. In the basic meth- 
ods, all ranked messages are included in the aggregation process. That enables 
us to make a fair judgement about the performance of the combined methods. 

1 ubuntuforums.org 

http:/ /www. tripadvisor.com/ShowForum-g28953-i4-New York.html 
3 http:/ /www. lemurproject.org/indri.php 
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As for parameter estimation, we have three parameters to estimate: the 
smoothing parameters /i for the virtual document and message language models, 
the size of the initial ranked list of messages Rq and the value of k. To estimate 
/i, we varied its value from 500 up to 4000; adding 500 in each run. To estimate 
the size of Rq , we varied its value from 500 up to 5000 adding 500 in each run. 
To estimate k — the number of top ranked messages, we varied its value from 
2 up to 6 adding 1 in each run. Then, an exhaustive grid search was applied to 
maximize MAP using 5-fold cross validation. 

Table 1. Retrieval performance of fusion top k messages on the Ubuntu dataset. 
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5 Result and Discussion 

This section reports the result of fusing only the top k messages. Tableland Ta- 
ble [2] presents the retrieval performance of these methods in the Ubuntu dataset 
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and the Travel dataset respectively. The symbols A and v denote statistically 
significant improvements or degradations over the virtual document model (VD) 
respectively, whereas, A and T denote statistically significant improvements or 
degradations of a top k model over its basic model, e.g CombSUM^ over Comb- 
SUM. All significance tests are conducted using ttest at p < 0.05. The upper parts 
from the tables contain retrieval performance of the virtual document (VD) and 
the basic aggregation methods, while the lower parts contain the performance 
of the combined methods. 

Generally, the data on these tables supports the result from previous re- 
searches. In the basic mode, BordaFuse, CombSUM, CombMNZ, expCombSUM, 
expCombSUM and expCombMNZ were able to produce better or comparable 
result with respect to VD. These methods favor threads with highly ranked 
messages. In contrast, CombGNZ, CombMED, CombANZ, CombMIN and exp- 
CombANZ might be affected by threads that have a lot of low scored messages. 
In the combined models, all rank based methods have almost similar results to 
their results on the basic mode. In contrast, the score based methods benefit 
largely from fusing only the top k messages. Among them, methods favour- 
ing threads with highly ranked messages brought significant improvements over 
baseline methods — see the performance of CombSUM, CombMNZ, expComb- 
SUM and expCombMNZ. In addition, CombANZ, CombMIN, CombMED and 
expCombANZ benefit as well. Therefore, focusing on top k messages is a good 
strategy to improve data fusion performance on thread retrieval. 

Discussion 

The improvement might be due to the removal of noise introduced when all 
ranked messages are considered. By focusing on the top k messages, low score 
messages are discarded leading to improving retrieval. In fact, ranking using 
only one message, e.g. CombMAX, might not be enough to capture the topical 
relevance of threads, while considering all ranked messages might introduce ir- 
relevant messages. Therefore, focusing on the top k ranked messages will balance 
these two aspects. This premise explains, to some extend, why no improvements 
were observed among rank based methods. Rank based methods inherently ad- 
dress the issue of low scoring messages. Furthermore, the near similar perfor- 
mance of CombSUM, CombMNZ and their exponential variants suggests that: 
once we focus on the top k messages, extra emphasis on these messages does not 
help. In other words, focusing on the top ranked messages is the main reason 
behind the observed improvements. 

To confirm this hypothesis, a study was conducted to investigate how does 
the performance change as k increases? As Figures [I] and [2] show, the optimal 
value of k for each method ranges between 2 to 5 messages. Going beyond the 
optimal value, rank based methods tend to have consistent performances that 
are similar to their result on the basic mode. In contrast, score based methods 
performed badly as k increases. That further supports the proposed hypothe- 
sis. Another interesting observation is the performance of CombGNZ. The result 
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Table 2. Retrieval performance of fusion top k messages on the Travel dataset. 
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Fig. 1. The performance of the combined methods as k changes on the Ubuntu dataset. 
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Fig. 2. The performance of the combined methods as k changes on the Travel dataset. 



shows the inferiority of CombGNZ to the aforementioned good performing meth- 
ods. However, |5)3j reported that PCS — using the geometric mean, outperformed 
CombMAX. That docs not contradict our findings. In |5l3j . PCS adds a padding 
step if the number of ranked messages is less than k, whereas we do not apply 
this step. We plan to study the effect of the padding step in future works. 



6 Conclusion and Future Works 

In this paper, we addressed the problem of online forums thread retrieval. We 
conducted experiments to investigate the performance of combining the pseudo 
cluster selection and voting techniques approaches. The results showed that the 
combined models can improve the performance of score based voting techniques 
significantly in various measures. 

Our future work has two directions. First, we will apply the combined meth- 
ods as aggregation methods on the other thread representations such as fusing 
the dialogue or the pair relevance scores [10] . Second, we will study the effect of 
the padding step on the performance of the combined models. 
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